Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure
نویسندگان
چکیده
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. The solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to recognize attributes and values, pair attributes with values, and form records. Data-integration techniques allow us to match source records with a target schema. Ontologically specified wrappers allow us to extract data from source records into a target schema. Experimental results show that we can successfully map data of interest from source HTML tables with unknown structure to a given target database schema. We can thus “directly” query source data with unknown structure through a known target schema.
منابع مشابه
Automating the extraction of data from HTML tables with unknown structure
Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of inter...
متن کاملA Semantic Approach to Internet Tabular Information Extraction
Extracting information from tables is essential for Internet information extraction. However, most web tables are designed in HTML format. To decipher their semantic meanings a system needs to deal with various layouts, which is quite cumbersome. Previous works have two major approaches: layout enumeration approach and wrapper approach. The first approach is to match the table with presorted la...
متن کاملExtracting Attributes and Their Values from Web Pages
We propose a method for extracting attributes and their values from Web pages. Our method makes use of word distributions estimated from plain Web pages. The key idea is to estimate word distribution by consulting ontologies built from HTML tables. In a series of experiments, we show that estimated word distributions are useful for extracting attributes and their values in various kinds of HTML...
متن کاملCapturing Semantic Hierarchies to Perform Meaningful Integration in HTML Tables
We present a new approach that automatically captures the semantic hierarchies in HTML tables, and semi-automatically integrates HTML tables belonging to a domain. It first automatically captures the attribute-value pairs in HTML tables by normalization and recognizing their headings. After generating global schema manually, it learns the lexical semantic sets and contexts, by which it then eli...
متن کاملHTML Table Interpretation by Sibling Page Comparison in the Molecular Biology Domain
There are large and growing amount of biological data that reside in different online repositories. Many of these repositories represent their data in tables. In order to automatically understand these online pages, a system that can interpret tables is desired. However, the longstanding problem of automatic table interpretation still illudes us [12]. We offer a solution for the common special ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002